Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart scheduler on error #6195

Merged
merged 1 commit into from
Oct 19, 2023

Conversation

mkindahl
Copy link
Contributor

If the scheduler receives an error, it will never restart again since bgw_restart_time is set to BGW_NEVER_RESTART, which will prevent all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested.

Fixes #5091

@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

Merging #6195 (6791e70) into main (21a3ebd) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #6195      +/-   ##
==========================================
+ Coverage   65.04%   65.07%   +0.03%     
==========================================
  Files         246      246              
  Lines       56955    56957       +2     
  Branches    12621    12621              
==========================================
+ Hits        37045    37064      +19     
  Misses      18042    18042              
+ Partials     1868     1851      -17     
Files Coverage Δ
src/bgw/scheduler.c 82.60% <100.00%> (+3.06%) ⬆️
src/guc.c 96.77% <100.00%> (+0.03%) ⬆️
src/loader/bgw_launcher.c 84.74% <100.00%> (+1.74%) ⬆️
src/loader/loader.c 82.86% <100.00%> (ø)

... and 55 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mkindahl mkindahl marked this pull request as ready for review October 13, 2023 12:04
@github-actions github-actions bot requested a review from antekresic October 13, 2023 12:04
@github-actions
Copy link

@antekresic, @jnidzwetzki: please review this pull request.

Powered by pull-review

@github-actions github-actions bot requested a review from jnidzwetzki October 13, 2023 12:04
@mkindahl mkindahl force-pushed the scheduler-restart-on-failure branch 4 times, most recently from c1784a0 to 08cbdd2 Compare October 16, 2023 09:58
@mkindahl mkindahl added the bug label Oct 16, 2023
@mkindahl mkindahl added this to the TimescaleDB 2.13 milestone Oct 16, 2023
@jnidzwetzki
Copy link
Contributor

@mkindahl

Have you seen the bgw_scheduler_control regression check return an error code when invoked multiple times?

test bgw_scheduler_control        ... ok        50458 ms
test bgw_scheduler_control        ... FAILED (test process exited with exit code 1)     5048 ms

@mkindahl mkindahl force-pushed the scheduler-restart-on-failure branch 2 times, most recently from b350662 to 0a8fefa Compare October 16, 2023 13:13
@mkindahl
Copy link
Contributor Author

@mkindahl

Have you seen the bgw_scheduler_control regression check return an error code when invoked multiple times?

test bgw_scheduler_control        ... ok        50458 ms
test bgw_scheduler_control        ... FAILED (test process exited with exit code 1)     5048 ms

Hmm... I though I'd fixed it.

Managed to fix it locally: let's see if it works now.

@mkindahl mkindahl force-pushed the scheduler-restart-on-failure branch 2 times, most recently from d396f4b to f9dfb12 Compare October 19, 2023 13:27
@mkindahl
Copy link
Contributor Author

@mkindahl
Have you seen the bgw_scheduler_control regression check return an error code when invoked multiple times?

test bgw_scheduler_control        ... ok        50458 ms
test bgw_scheduler_control        ... FAILED (test process exited with exit code 1)     5048 ms

Hmm... I though I'd fixed it.

Managed to fix it locally: let's see if it works now.

For some reason. stop_background_workers were not taking effect, so switched to using pg_terminate_backend. It seems stop_background_workers is killing a scheduler for a different database. Not sure why, but that looks like a bug that is unrelated to this pull request.

If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
@mkindahl mkindahl force-pushed the scheduler-restart-on-failure branch from f9dfb12 to 6791e70 Compare October 19, 2023 13:31
@mkindahl mkindahl merged commit bebd1ab into timescale:main Oct 19, 2023
36 checks passed
@mkindahl mkindahl deleted the scheduler-restart-on-failure branch October 19, 2023 20:05
@timescale-automation
Copy link

Automated backport to 2.12.x not done: cherry-pick failed.

Git status

HEAD detached at origin/2.12.x
You are currently cherry-picking commit bebd1ab42.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   .unreleased/feature_6195
	modified:   src/bgw/scheduler.c
	modified:   src/guc.c
	modified:   src/guc.h
	modified:   src/loader/bgw_launcher.c
	modified:   src/loader/bgw_launcher.h
	modified:   src/loader/loader.c
	new file:   tsl/test/expected/bgw_scheduler_restart.out
	new file:   tsl/test/sql/bgw_scheduler_restart.sql

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   tsl/test/sql/CMakeLists.txt


Job log

@timescale-automation timescale-automation added the auto-backport-not-done Automated backport of this PR has failed non-retriably (e.g. conflicts) label Oct 19, 2023
@mkindahl mkindahl restored the scheduler-restart-on-failure branch November 21, 2023 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-not-done Automated backport of this PR has failed non-retriably (e.g. conflicts) bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Automatically restart background worker scheduler
4 participants